RAG Implementer

Build production-ready retrieval-augmented generation systems.

Core Principle

RAG = Retrieval + Context Assembly + Generation

Use RAG when you need LLMs to access fresh, domain-specific, or proprietary knowledge that wasn't in their training data.

⚠️ Prerequisites & Cost Reality Check

STOP: Have You Validated the Need for RAG?

Before implementing RAG, confirm:

Problem validated

- Completed

product-strategist

Phase 1 (problem discovery)

Users need AI search

- Tested with simpler alternatives (see below)

ROI justified

- Calculated cost vs benefit of RAG vs alternatives

Try These FIRST (Before RAG)

RAG is powerful but expensive. Try cheaper alternatives first:

1. FAQ Page / Documentation (1 day, $0)

Create well-organized FAQ or docs

Add search with Cmd+F

Works for:

<50 common questions, static content

Test:

Do users find answers? If yes, stop here.

2. Simple Keyword Search (2-3 days, $0-20/month)

Use Algolia, Typesense, or PostgreSQL full-text search

Good enough for 80% of use cases

Works for:

<100k documents, keyword matching sufficient

Test:

Do users get relevant results? If yes, stop here.

3. Manual Curation (Concierge MVP) (1 week, $0)

Manually answer user questions

Build FAQ from common questions

Works for:

<100 users, validating if users want AI

Test:

Do users value your answers enough to pay? If yes, consider RAG.

4. Simple Semantic Search (1 week, $30-50/month)

Use OpenAI embeddings + Postgres pgvector

Skip complex retrieval, re-ranking, etc.

Works for:

<50k documents, basic semantic search

Test:

Are embeddings better than keyword search? If no, stop here.

Cost Reality Check

Naive RAG (Prototype):

Time:

1-2 weeks

Cost:

$50-150/month (vector DB + embeddings + API calls)

When:

Prototype, <10k documents, proof of concept

Advanced RAG (Production):

Time:

3-4 weeks

Cost:

$200-500/month (hybrid search, re-ranking, monitoring)

When:

Production, 10k-1M documents, validated demand

Modular RAG (Enterprise):

Time:

6-8 weeks

Cost:

$500-2000+/month (multiple KBs, specialized modules)

When:

Enterprise, 1M+ documents, mission-critical

Decision Tree: Do You Really Need RAG?

Do users need to search your content?

│

├─ No → Don't build RAG ❌

│

└─ Yes

├─ <50 items? → FAQ page ✅ ($0)

│

└─ >50 items?

├─ Keyword search enough? → Use Algolia ✅ ($0-20/mo)

│

└─ Need semantic understanding?

├─ <50k docs? → Simple semantic (pgvector) ✅ ($30/mo)

│

└─ >50k docs?

├─ Validated with users? → Build RAG ✅

└─ Not validated? → Test with Concierge MVP first ⚠️

Validation Checklist

Only proceed with RAG implementation if:

Tested simpler alternatives (FAQ, keyword search, manual curation)

Users confirmed they need AI-powered search (not just you think they do)

Calculated ROI: cost of RAG < value users get

Have >50k documents OR complex semantic search requirements

Budget: $200-500/month for infrastructure

Time: 3-4 weeks for production implementation

If any checkbox is unchecked:

Go back to

product-strategist

or

mvp-builder

skills to validate first.

See also:

PLAYBOOKS/validation-first-development.md

for step-by-step validation process.

8-Phase RAG Implementation

Phase 1: Knowledge Base Design

Goal

Create well-structured knowledge foundation

Actions

:

Map data sources (internal: docs, databases, APIs / external: web, feeds)

Filter noise, select authoritative content (prevent "data dump fallacy")

Define chunking strategy: semantic chunking based on structure

Add metadata: tags, timestamps, source identifiers, categories

Validation

:

All data sources catalogued and prioritized

Data quality assessed (accuracy, completeness, freshness)

Chunking strategy tested with sample documents

Metadata schema validated for search effectiveness

Common Chunking Strategies

:

Fixed-size: 500-1000 tokens, 50-100 token overlap

Semantic: By paragraph, section headers, or topic boundaries

Recursive: Split by structure (markdown headers, code blocks)

Phase 2: Embedding Strategy

Goal

Choose optimal embedding approach for semantic understanding

Actions

:

Select embedding model:

text-embedding-3-large

(1536 dim) for general, domain-specific for specialized

Plan multi-modal needs (text, code, images, tables)

Decide on fine-tuning: use domain data if general embeddings underperform

Establish similarity benchmarks

Validation

:

Embedding model benchmarked on domain data

Retrieval accuracy tested with known query-document pairs

Storage and compute costs validated

Model Selection

:

General: OpenAI

text-embedding-3-large

,

text-embedding-3-small

Code:

code-search-babbage-code-001

or StarEncoder

Multilingual:

multilingual-e5-large

Phase 3: Vector Store Architecture

Goal

Implement scalable vector database

Actions

:

Choose vector DB (Pinecone, Weaviate, Qdrant, Chroma, pgvector)

Configure index: HNSW for speed, IVF for scale

Plan scalability: data growth and query volume

Implement backup, recovery, security

Validation

:

Vector store benchmarked under expected load

Index optimized for retrieval speed and accuracy

Backup and recovery tested

Security controls implemented

Vector DB Decision

:

Managed cloud → Pinecone

Self-hosted, feature-rich → Weaviate

Lightweight, local → Chroma

Cost-conscious → pgvector (Postgres extension)

High-performance → Qdrant

Phase 4: Retrieval Pipeline

Goal

Build sophisticated retrieval beyond simple similarity search

Actions

:

Implement hybrid retrieval: semantic search + keyword (BM25)

Add query enhancement: expansion, reformulation, multi-query

Apply contextual filtering: metadata, temporal constraints, relevance ranking

Design for query types: factual (precision), analytical (breadth), creative (diversity)

Handle edge cases: no relevant results found

Advanced Techniques

:

Re-ranking

Use cross-encoder after initial retrieval (e.g.,

cross-encoder/ms-marco-MiniLM-L-12-v2

)

Query routing

Route different query types to specialized strategies

Ensemble methods

Combine multiple retrieval approaches

Adaptive retrieval

Adjust top-k based on query complexity

Validation

:

Retrieval accuracy tested across diverse query types

Hybrid retrieval outperforms single-method baselines

Query latency meets requirements (<500ms ideal)

Edge cases and fallbacks tested

Phase 5: Context Assembly

Goal

Transform retrieved chunks into optimal LLM context

Actions

:

Rank and select: prioritize by relevance score, recency, source authority

Synthesize: merge related chunks, avoid redundancy

Compress: use LLMLingua or similar for token optimization

Mitigate "lost in the middle": place critical info at start/end

Adapt dynamically: adjust context based on conversation history

Context Engineering Integration

:

Blend RAG results with system instructions and user prompts

Maintain conversation coherence across multi-turn interactions

Implement context persistence for follow-up queries

Balance context size vs. information density

Validation

:

Context relevance validated against human judgments

Token optimization maintains accuracy

Multi-turn conversations maintain coherence

Assembly latency <200ms

Phase 6: Evaluation & Metrics

Goal

Measure RAG system performance comprehensively

Retrieval Quality

:

Precision@K

Fraction of top-K results that are relevant

Recall@K

Fraction of relevant docs in top-K

MRR (Mean Reciprocal Rank)

Average rank of first relevant result

NDCG

Ranking quality with graded relevance

Generation Quality

:

Faithfulness

Generated content accuracy vs. sources

Answer Relevance

Response relevance to query

Context Utilization

How effectively LLM uses retrieved info

Hallucination Rate

Frequency of unsupported claims

System Performance

:

End-to-End Latency

Query to answer (<3 seconds target)

Retrieval Latency

Time to retrieve and rank (<500ms)

Token Efficiency

Information density per token

Cost Per Query

Combined retrieval + generation costs

Validation

:

Baseline metrics established

A/B testing framework for config comparisons

Automated evaluation pipeline deployed

Human evaluation protocols for ground truth

Phase 7: Production Deployment

Goal

Deploy with enterprise-grade reliability and security

Deployment

:

Containerize with Docker/Kubernetes

Implement load balancing across RAG instances

Add caching for frequent queries

Graceful degradation: fallback to base model on component failure

Security

:

Role-based access controls for knowledge base

Data masking and PII protection

Audit logging for compliance

Prompt injection defense

Monitoring

:

Real-time metrics dashboard (latency, cost, accuracy)

Query analysis for patterns and failure modes

Cost tracking and optimization alerts

Performance profiling for bottlenecks

Validation

:

Production handles expected traffic

Security prevents unauthorized access

Monitoring provides actionable insights

Incident response procedures tested

Phase 8: Continuous Improvement

Goal

Establish processes for ongoing enhancement

Data Pipeline

:

Automated knowledge base updates (real-time or scheduled)

Quality monitoring: detect data drift and degradation

Source diversification: add new data sources

Feedback integration: user corrections and preferences

Model Evolution

:

Evaluate and migrate to improved embeddings

Fine-tune on domain data regularly

Upgrade architecture: Naive → Advanced → Modular RAG

Expand multi-modal support (images, audio, video)

Optimization

:

Analyze query patterns, optimize for common needs

Improve cache hit rates

Tune vector indices regularly

Balance performance vs. costs

Validation

:

Automated improvement pipelines functioning

Performance trends show improvement

User satisfaction increasing

System adapts to changing needs

Key RAG Principles

1. Relevance Over Volume

Quality curation > massive datasets

Remove outdated/low-quality content continuously

Prioritize most relevant info to prevent "lost in the middle"

2. Semantic Understanding

Use embeddings for true semantic matching, not just keywords

Recognize query intent (factual, analytical, creative)

Adapt retrieval strategy based on context

3. Multi-Modal Intelligence

Handle text, images, code, tables, structured data

Enable cross-modal retrieval (text query → image results)

Preserve document structure and formatting

4. Temporal Awareness

Prioritize recent info for time-sensitive topics

Maintain historical access when relevant

Integrate real-time data feeds for dynamic domains

5. Transparency & Trust

Always provide source citations

Indicate confidence levels

Explain why specific information was selected

Standard RAG Response Format

{

"answer"

:

"Generated response incorporating retrieved information"

,

"sources"

:

[

{

"content"

:

"Retrieved text chunk"

,

"source"

:

"Document/URL identifier"

,

"relevance_score"

:

0.95

,

"chunk_id"

:

"unique_identifier"

}

]

,

"confidence"

:

0.87

,

"retrieval_metadata"

:

{

"chunks_retrieved"

:

5

,

"retrieval_time_ms"

:

150

,

"generation_time_ms"

:

800

}

Critical Success Rules

Non-Negotiable

:

✅ Source attribution for every response

✅ Validate generated content against sources (prevent hallucination)

✅ Filter sensitive data before retrieval

✅ Respond within latency thresholds (<3 seconds)

✅ Monitor and optimize costs continuously

✅ Comply with security policies

✅ Graceful degradation on failures

✅ Comprehensive testing before production

Quality Gates

:

Before Production: >85% accuracy on evaluation dataset

Ongoing: User satisfaction >4.0/5.0

Performance: 95th percentile <5 seconds

Reliability: 99.5% uptime

Cost: Within 10% of budget

Advanced Patterns

Modular RAG Architecture

Search Module

Query understanding and reformulation

Memory Module

Long-term conversation persistence

Routing Module

Query routing to specialized knowledge bases
Predict Module: Anticipatory pre-loading based on context Hybrid RAG + Fine-tuning RAG for dynamic, frequently changing knowledge Fine-tuning for domain-specific reasoning patterns Combine strengths for maximum effectiveness Related Resources

rag implementer

安装